Variational method for estimating the rate of convergence of 
Markov Chain Monte Carlo algorithms 

Fergal P. Casejt 
Complex and Adaptive Systems Laboratory, 
University College Dublin, Dublin 4, Ireland 

Joshua J. Waterfall 

Department of Molecular Biology and Genetics, 
Cornell University, Ithaca, NY 14853, USA 

Ryan N. Gutenkunst 

Department of Biological Statistics and Computational Biology, 
Cornell University, Ithaca, NY 14853, USA 

Christopher R. Myers 
Computational Biology Service Unit, 
Life Sciences Core Laboratories Center, 
Cornell University, Ithaca, NY 14853, USA 

James P. Sethna 
Laboratory of Atomic and Solid State Physics, 
Cornell University, Ithaca, NY 14853, USA 
(Dated: July 16, 2008) 



1 



Abstract 

We demonstrate the use of a variational method to determine a quantitative lower bound on 
the rate of convergence of Markov Chain Monte Carlo (MCMC) algorithms as a function of the 
target density and proposal density. The bound relies on approximating the second largest eigen- 
value in the spectrum of the MCMC operator using a variational principle and the approach is 
applicable to problems with continuous state spaces. We apply the method to one dimensional ex- 
amples with Gaussian and quartic target densities, and we contrast the performance of the Random 
Walk Metropolis-Hastings (RWMH) algorithm with a "smart" variant that incorporates gradient 
information into the trial moves, a generalization of the Metropolis Adjusted Langevin Algorithm 
(MALA). We find that the variational method agrees quite closely with numerical simulations. We 
also see that the smart MCMC algorithm often fails to converge geometrically in the tails of the 
target density except in the simplest case we examine, and even then care must be taken to choose 
the appropriate scaling of the deterministic and random parts of the proposed moves. Again, this 
calls into question the utility of smart MCMC in more complex problems. Finally, we apply the 
same method to approximate the rate of convergence in multidimensional Gaussian problems with 
and without importance sampling. There we demonstrate the necessity of importance sampling 
for target densities which depend on variables with a wide range of scales. 

PACS numbers: 05.10.Ln, 02.70.Tt, 02.50.Ng, 02.70.Rr 
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I. INTRODUCTION 



Markov Chain Monte Carlo (MCMC) methods are important tools in parametric mod- 
eling; where the goal is to determine a posterior distribution of parameters given a 
particular dataset. Since these algorithms tend to be computationally intensive, the chal- 
lenge is to produce algorithms that have better convergence rates and are therefore more 



efficient 



Of particular concern are situations where there is a large range of scales as- 



sociated with the target density, which we find are widespread in models from many different 



fields \5 



In this manuscript we quantify the convergence of the MCMC method by the second 
largest eigenvalue in absolute value for the associated operator in L^. This is not the 
only numerical quantity that can be used to describe the convergence properties. Other 
authors quantify convergence with different metrics: computing the constant of geometric 
convergence with respect to the total variation norm [l^, monitoring sample averages 
evaluating mixing of parallel chains 12|] or looking at the integrated autocorrelation time of 



functions of the sample 
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14 [.The connection between the second eigenvalue and total 
15|. To connect the second eigenvalue estimates to metrics 



variation norm is discussed in 
based on autocorrelation, we would argue informally that the second eigenvalue determines 
the autocorrelation time of the slowest mixing function of the sample and as such represents 
a "worst" case for the length of time you would need to run the chain to reduce the variance 
of sample averages to a predefined level. 

There are a number of techniques to either determine exactly or bound the second eigen- 
value or the constant of geometric convergence for MCMC algorithms on discrete state 
spaces but the methods for finding quantitative bounds for continuous state 

spaces require a more technical formulation. Where work has been done in that area, up- 



Id, 



2l| or 



per bounds on the convergence rate can be derived using purely analytical 
semi-analytical techniques 22], but may not always be very useful for selecting parameters 
optimally. Therefore, in this work, we show that a conceptually straightforward variational 
method can provide convergence rate estimates for continuous state space applications. In 
contrast to earlier closely related work [3, [3] , we move away from mathematical formalities, 
focussing from the start on specific examples and step through the calculations that provide 
the second eigenvalue bounds. In Roberts and Rosenthal ISj, rules of thumb are provided 



for determining the optimal acceptance rate and step lengths for both the Random Walk 
Metropolis-Hastings and the Metropolis Adjusted Langevin Algorithm, in the asymptotic 
limit of infinite dimensions where it can be proved that those methods are approximated 
by diffusion processes. The rules are widely used as they are independent of the specific 
form of the target density, appear from numerical simulations to be appropriate far from the 
infinite dimensional asymptotic limit and are easily implemented. In contrast, the approach 
proposed here is to establish convergence properties for particular MCMC algorithms based 
on their performance on simple target distributions without the need to set up a diffusion 
approximation in an infinite dimensional limit. Poor performance or lack of convergence on 
these simple distributions then indicates that further application with more complex tar- 
get densities will also suffer from convergence problems. Conversely, an identification of a 
range of parameters which provide good convergence properties for simple target distribu- 
tions may be used as a starting point for further applications. Even though we provide only 
lower bounds on the second eigenvalue we show these bounds can be remarkably tight due 
to careful choice of test functions, and computing the approximate convergence rate as a 
function of algorithm parameters allows us to optimally tune those parameters. 

We have been able to obtain explicit formulas for one dimensional example problems but 
the method may be more generally applicable, when applied in an approximate way, as we 
demonstrate for a multidimensional problem. 

II. MARKOV CHAIN MONTE CARLO 

Typically, one wishes to obtain a sample xi,X2... from a probability distribution tt{x) 
which is sometimes called the target distribution. An MCMC algorithm works by creating 
a Markov chain that has 7r(x) as its unique stationary distribution, i.e. after many steps of 
the chain any initial distribution converges to tt{x). A sufficient condition to establish tt{x) 
as the stationary distribution is that the chain be ergodic and that the transition density, 
t{x,y), of the chain satisfy detailed balance: 



7r{x) t{x,y) = n{y)t{y,x). 



Given a proposal density q{x, y) for generating moves, one way to construct the required 



transition density 
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24| is to define t{x, y) = a{x, y) q{x, y) where 



a[x,y)=min[— ^^^'^ (1) 



^q{x,y)n{x) 

is the acceptance probability of the step x ^ y. Obtaining the sample from the station- 
ary distribution then involves letting the chain run past the transient (burn-in) time and 
recording iterates from the late time trajectory at time intervals exceeding the correlation 
time. How long it takes to reach the stationary distribution determines the efficiency of the 
algorithm and for a given target distribution, clearly it depends on the choice of the proposal 
density. We can write down the one-step evolution of a probability density p{x) as a linear 
operator: 



{^P){y) = f t{x,y)p{x)dx + (^1 - j t{y,x)dxj p{y) 
{t{x, y)p{x) - t{y, x)p{y)) dx + p{y) 



where dx = dxi . . . dxn, dy = dyi . . . dyn, n is the dimension of the state space and all 
integrals are from — oo to oo here and elsewhere in this manuscript. The second form makes 
it explicit that p{y) = n{y) is the stationary distribution by the detailed balance relation. 

Now, if the linear operator has a discrete set of eigenfunctions and eigenvalues, it holds 
that the asymptotic convergence rate is determined by the second largest eigenvalue in 



absolute value (the largest being one) 25|, l26|]. We will write this eigenvalue as A*, and will 



refer to it as the second eigenvalue meaning the second largest in absolute value. Geometric 



convergence of the chain is ensured when A* < 1 15|], and then the discrepancy between the 
density at the w}^ iterate of the chain and the target density decreases as (A*)*" for large 
m. Many previous authors have taken this second eigenvalue approach, in both the finite 
and continuous state space settings Q, Q, 22, 27|, as it provides a useful quantifier for 



the convergence rate. Ideally we would like algorithm parameters to be adjusted such that 
A* is as as small as possible. 

The variational calculation allows us to obtain an estimate for A*, but before we can 
do this we need to convert our operator into a self-adjoint form which ensures that the 
eigenfunctions are orthogonal. This is easily accomplished by a standard technique [l^ 
of modifying the transition density by s{x,y) = y)A/vr(x)/y^7r(y) and our self-adjoint 



operator is then given by 

{<Sp){y) = [ s{x,y)p{x)dx+ (^1- j t{y,x)dx^ p{y) (2) 
{six, y)p{x) - t{y, x)p{y)) dx + p{y) (3) 



where the "diagonal" part of the old operator (multiplying p{y)) need not be transformed 
using s{x,y). It is easy to show that defined as above, S is self-adjoint using the standard 
inner product in with respect to Lebesgue measure. Note that if u{x) is an eigenfunction 
of the operator S, then -\/ 7r{x)u{x) is an eigenfunction of the original operator C with the 
same eigenvalue. 

A. Metropolis-Hastings and smart Monte Carlo 

We consider two MCMC algorithms which essentially differ only in the choice of proposal 
density and acceptance probability that is used in selecting steps. We will refer to the 
Random Walk Metropolis-Hastings (RWMH) algorithm as that which uses a symmetric 
proposal density to determine the next move; for example, a Gaussian centered at the 
current point: 



q{x,y) = y^\L\/{27r) exp (— — x)'^L{y — x)/2) where L is an inverse covariance matrix 
that needs to be chosen appropriately for the given problem [importance sampling). In 
other words, the proposed move from x to y is given hj y = x + R where R ~ N(0, L~^) is a 
normal random variable, mean and covariance L~^. Thus the update on the current state 
has no deterministic component. We will see that when the target density is not spherically 
symmetric, a naive implementation of the Metropolis-Hastings algorithm where the step 
scales are all chosen to be equal leads to very poor performance of the algorithm. As would 
be expected the convergence deteriorates as a function of the ratio of the true scales of the 
target density to the scale chosen for the proposal density. 



One variant used to accelerate the standard algorithm is a smart Monte Carlo method 28 1 
that uses the gradient of the negative of the log target density at every step, G{x) = 
— Vlog(7r(x)) to give 



q{x,y) = ^exp (^—{y - {x - H-^G{x)fL{y - {x - H-'G{x))^ (4) 



/27r 



and H can be considered either as a constant scaling of the gradient part of the step or, 



if it is the Hessian of — log(7r(x)), as producing a Newton-hke optimization step 29|. The 
move to y is generated as y = x — H^^G{x) + i?, so now we have a random component 
R ~ N(0,L~^) and a deterministic component —H^^G{x). Viewed hke this, moves can be 
considered to be steps in an optimization algorithm (moving to maximize the probability of 
the target density) with random noise added. We will see that with an optimal choice of 
H and for Gaussian target densities, the smart Monte Carlo method can converge in one 
step to the stationary distribution. We will also see that for a one dimensional non-Gaussian 
distribution it actually fails to converge, independent of the values of the scaling parameters. 

B. Variational method 

Once we have the self-adjoint operator for the chain, S from Eqn. [3l and we know 



the eigenfunction with eigenvalue Ai = 1, ^/^l{x), we can look for a candidate second 
eigenfunction in the function space orthogonal to the first eigenfunction where the inner 
product is defined by (^1,^2) = / Pi{x)p2{x) dx. Given a family of normalized candidate 
functions in this space, Va{x), with variational parameter a, the variational principle 0, Q 

states 

miXXa\{Va,SVa)\< \* <l (5) 

and depending on how accurately our family of candidate functions captures the true second 
eigenfunction, this can give quite a close approximation to the second dominant eigenvalue. 
In the problems we examine in the following sections the target densities have an even 
symmetry which makes it straightforward to select a variational trial function: any function 
with odd symmetry will naturally lie in the orthogonal space. For more complicated problems 
with known symmetries this general principle may be useful in selecting variational families 
for the purposes of algorithm comparison. Another approach to constructing the variational 
family is shown in the section on multidimensional target densities: choose the test function 
as a linear combination of two functions, one with the properties that are required (i.e. 
slow convergence to the target distribution) and then the additional term is used merely to 
preserve orthogonality. 

This variational method for providing a lower bound to the second eigenvalue of the 



MCMC algorithm was foreshadowed by a similar approach of Lawler and Sokal 25|. 



These authors considered the flow of probabihty out of a subset A of the state space; 
in our language, their test functions were conflned to the family va{x) = {7r{A'^)xA — 
7r(A)x^c)/(7r(A)7r(A'^)), where xa is the indicator function on the set A. By allowing for 
more general test functions, we establish not only rigorous but also relatively tight bounds 
on convergence rates, providing guidance for parameter optimization and algorithm com- 
parisons. 

Writing out explicitly for S in {va,Sva) we have 

{va,Sva) = j j Va{x)s{x,y)va{y)dxdy - j j t{y,x){va{y)f dxdy + 1 . (6) 

As we will see in the following section, the lower bound in Eqn. [S] can be arbitrarily close 
to 1 and therefore equality holds. In these situations we can also show that the chain 
does not converge geometrically, based on the total variation norm deflnition of geometric 
convergence 10|. However, whether the type of convergence changes or not, we still refer to 



the magnitude of the second eigenvalue estimate in determining efficiency of the algorithm. 
The rationale is that the second eigenvalue determines the longest possible autocorrelation 
time of a function of the MCMC sample; the worst case autocorrelation time will be of 
the order l.{]/log{\*) which could be extremely long. We will also see that there can be 
eigenvalues in the spectrum that are close to — 1 which determine the asymptotic convergence 
rate, i.e. A* = |A„| where A„ < 0. Interestingly, for this situation there is oscillatory behavior 
of the Markov chain. 



III. EXAMPLES 



A. Gaussian target density 

Consider the simplest case of a one dimensional Gaussian target density 7r(x) = 
^/k/(2Ti) exp(-A;xV2) with variance 1/k. Under the RWMH algorithm, the proposal density 
is 

(i{^^y) = \j^^^v(^-\Ky-'^f^ ■ (7) 

The issue is to determine / optimally; a first guess would be that I = k is the best choice. 
The rationale behind this is that since the target and proposal densities have the same form, 
if they also have the same scales, then the convergence rate might be expected to be optimal. 
We will see that this is not actually correct. 



To begin, define a variational function Va{x) oc xexp(— ortliogonal to tlie target 
density and normalized sucli tliat J v^dx = 1. We can motivate tliis clioice by recognizing 
that any initial distribution tliat is asymmetric will most likely have a component of this 
test function, and a convergence rate estimate based on it roughly corresponds to how fast 
probability "equilibrates" between the tails. More commonly, variational calculations will 
use linear combinations of many basis functions with the coefficients as variational parame- 
ters. We find here that including higher order terms in the test function is unnecessary as 
we obtain tight enough bounds just retaining the lowest order term. 

We proceed by evaluating Eqn. [6] noting that because of the form of the acceptance prob- 
ability, Eqn. [H there are two functional forms for the kernels t{x, y) and s{x, y) depending on 
the sign of — x^, i.e. whether the "energy" change, AE{x, y) = — log(7r(?/)) + log(7r(x)) = 
k{y'^ — a;^)/2, is positive or negative. It is then convenient to use the coordinate transforma- 
tion y = rx, X = X OT X = ry, y = y where — 1 < r < 1 and — oo < x,y < oo to evaluate the 
integrals. An explicit expression for {va,Sva) can be obtained for this case of a Gaussian 
target density. 

Next, we use a numerical optimization method to maximize the bound defined by Eqn. 
[5] with respect to a. The result of this analysis is shown in Fig. [1] along with an empirically 
determined convergence rate for comparison. (To obtain the rate empirically, we run the 
MCMC algorithm for many iterates on a random initial distribution and observe the point- 
wise differences from the distribution of the m^^ iterate and the target distribution for large 
m. These differences are either fit using Hermite polynomial functions or by looking for the 
multiplicative factor which describes the geometric decay of the m*'^ difference from one it- 
erate to the next.) The variational bound tightly matches the empirical obtained eigenvalue 
estimates in this case, and an optimum step size / can be ascertained. Clearly our I = 1 
initial guess for the best scaling is far from optimal. 

It is also worth comparing the optimal step scale with those obtained from different meth- 

nn 

ods. In p^, IM], a derivation of the optimal step size and acceptance rate is proposed based 
on minimizing the integrated autocorrelation time of an arbitrary function of the chain's 
states in stationarity. By approximating the chain as an infinite dimensional diffusion pro- 
cess, formulas are derived for the optimal scaling of steps. For our one dimensional Gaussian 
target density, the proposal density's optimal variance is suggested to be (1/2.38)^ = 0.176 
which is surprisingly close to the estimate we have obtained using the variational method 
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FIG. 1: Variational estimate on the second eigenvalue for the one dimensional Gaussian problem 
using the RWMH method, with k = 1.0. The variational estimate is the solid line and the em- 
pirically determined values are marked with stars. Some of the empirical values seem to be less 
than the lower bound, but this is due to inaccuracies in their estimation. The optimum occurs at 
/ w 0.165. 



from Fig. [T], / ^ 0.165. However, given the infinite dimensional limit in which the former 
approximation is made, and the different convergence criterion based on autocorrelation 
time rather than second eigenvalue, the agreement may be merely coincidental. 

Moving to the one dimensional smart Monte Carlo, we have a Gaussian proposal density 
of the form : 

y) = exp (^-h (^y-{x- ^x)^ j (8) 

where 1// is the variance of the random part of the step and l/Zi is the scale of the deter- 
ministic part. (Letting /i — > oo we recover the RWMH algorithm of Eqn. [71) 

Taking h = k corresponds to performing a Newton step at every iterate of the algorithm. 
Thus, since the log of the target density is purely quadratic, the current point will always 
be returned to the extremum at by the deterministic component of the smart Monte 
Carlo step and the random component will give a combined move drawn from q{x, y) = 
q{y) = -\///(27r) exp (— which has the form of an independence sampler 2^|. If we 
then also choose I = k, we see immediately that we are generating moves from the target 



distribution from the beginning, i.e. we have convergence in one step starting from any 
initial distribution. 

In real problems, however, — log(7r(x)) will not be quadratic. We may obtain an estimate 
for / and h by considering its quadratic approximation or curvature but in many cases those 
estimates will have to be adjusted. If the curvature is very small (or in multidimensional 
problems if the quadratic approximations are close to singular), the parameters will have to 
be increased to provide a step size control to prevent wildly unconstrained moves (analogous 
to the application of a trust region in optimization methods 3]). If the curvature is large 
but we believe that the target density is multimodal, we need to decrease the parameters to 
allow larger steps to escape the local extrema. Therefore we examine in the following the 
dependence of the convergence rate as we vary both of the parameters / and h. 

The acceptance probability Eqn. [T]has two functional forms separated by a boundary in 
the (x, y) plane given by 

k + 4 (-2 + ) iy' - = Kk, h, W -x') = . (9) 



h\ h_ 

In particular, the acceptance probability is 

a{x,y) =min(^exp (^-^b{k,h,l){y'^ - x^)^ ,1^ . (10) 

Now we have a complication over the RWMH method because depending on the sign of 
the coefficient function b{k,h,l) in Eqn. [9l we find that either a{x,y) < 1 on \y\ > \x\, 
a{x,y) = 1 on \y\ < \x\ or vice versa. This is shown in Fig. [2l 

As before, for a given value of h and /, we need to break up the double integrals of the 
scalar product {va,Sva), Eqn. El into the appropriate regions. Our choice of variational 
function is the same as before (since the target density is the same) and we again can get 
an explicit (but complicated) expression for Eqn. [6] which we maximize with respect to a. 
The results of this analysis are shown in Fig. [3] (a), where we fix /c = 1.0 and vary h, I. We 
have confirmed that these lower bounds are quite accurate as shown in Fig. [3] (b). 

The remarkable feature of these results is that even for this simple Gaussian problem, 
the selection of step scale parameters h, I is critical to achieve convergence. As already 
mentioned, there is a trivial choice of optimum with h = I = k = 1 that gives one step 
convergence from any initial distribution (and therefore A* = 0). However, if we change 
parameters infinitesimally such that l = l + e,h = l (e > 0) we go through a discontinuous 




(a) (b) (c) 

FIG. 2: Regions in xy plane where acceptance probability a{x,y) < 1 or a{x,y) = 1, when (a) 
6(1, h,l) > and (b) 6(1, h, I) < 0. The equation for the boundary is shown in (c), see Eqn. [9]with 
k = 1.0. (The RWMH algorithm will only have regions described by (a).) 

transition where we see no convergence from any initial distribution. This can be understood 
by recognizing that after one step we will have a proposal density (before accept /reject) cx 
exp(— (1 + e)x^/2) which has a factor exp(— less probability in its tails than the target 



density. Suppose there is an initial distribution or point mass concentrated at x = v2M/ ^/e, 
M >> 1. The proposed step of the smart Monte Carlo algorithm, starting at x, will revisit 
X too infrequently by a factor exp(— M). Thus detailed balance will force the transition 
X to be accepted with a probability of only exp(— M), and thus the initial distribution 
will take an arbitrarily long time to converge to the target density. 

More formally, we can compute the probability of rejection, r(x) = 1 — J t{x,y)dy, when 
h, I are as above and we find, 



where b{k, h,l) < and $ is the cumulative normal (0, 1) distribution function. We note 
that ess sup r(x) = 1 by continuity of r(x), and then we use Proposition 5.1 from 3l| to 
conclude that the Markov chain is no longer geometrically convergent for these values of h 
and /. 

In fact this is only one of the two disconnected regions where no convergence is observed in 
Fig. [3l The largest of the two (with h > 1/2) is defined exactly by the equation 6(1, h,l) < 
(compare Fig. Wic) with Fig. [3] (a)). In this region the bound on the second eigenvalue 
approaches 1 as the variational parameter, a — 0. This corresponds to a perturbation on 
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(a) (b) 

FIG. 3: Estimate of second eigenvalue for the symmetrized smart Monte Carlo operator, (a) A; = 1 
is fixed and /i, I are allowed to vary, h = 1.0, ^ = 1.0 is the optimal scaling for deterministic and 
random parts of the step. The solid diagonal line is the parameter restrictions that yield the MALA 
algorithm (see text), (b) We take a slice through this surface at / = 1.5 and empirically determine 
the second eigenvalue at points along this curve (stars). The error bars are too small to be seen. 
Dashed lines are discontinuities. 

the target density of X\J 7r(x) for the unsymmetrized MCMC operator C. In other words, we 
have a test distribution that has exponentially more probability in its tails than the target 
density. For initial states x arbitrarily far away from the origin, the acceptance probability 
a{x,y) in the region \y\ < \x\ is arbitrarily small. To see this, note that Eqn. [10] is an 
exponentially decaying function of y"^ — x^ in this region, and given the form of the proposal 
density Eqn. [HI we see that the expected value of y^ — x^ is arbitrarily large and negative. 
Thus states far out will never be "allowed back" and the fat tails of \fK[x) will never shrink 
back down those of -7r(x). Furthermore, moves x y where \y\ > \x\ are always accepted 
(because a{x,y) = 1 on \y\ > \x\) which simultaneously prevents convergence. The situation 
is analogous to that described for / = 1 + e and h = k = 1, except now there is a cutoff 
both on the deterministic step and the random step. A typical example of this is shown in 
Fig. [H Once we cross to the 6(1, /i, /) > region, moves x — > y where \y\ < \x\ are always 
accepted by Eqn. [TO] (Fig. [2] (a)). Therefore excess probability in the tails is allowed to flow 
back into the central part of the distribution and the convergence is not blocked. 




FIG. 4: Forty iterates of the smart Monte Carlo algorithm (solid lines), Eqn. [HI when the initial 
distribution is normal with standard deviation five times the Gaussian target density (dashed line). 
Parameters are chosen to be in the region of no convergence {h = 2.0, / = 1.5), see Fig. [3] (a). We 
see that the tails of the initial distribution are essentially unchanging after many iterates and have 
failed to converge to the target density 



In the second region where no convergence is observed, {h < 1/2 in Fig. [3]), we have a 
situation where the deterministic step alone (taking / oo) leads to the proposed moves 
being generated by an unstable mapping, from the {n — l)*'^ to ra*^ iterate: x^^^ = x*^""^^ — 
^^(n-i) ^]2gj.g ^ > 2. The trial variational function for this situation also maximizes the 
bound as a ^ 0, again implying that the tails are not decaying to the stationary distribution. 
The reason is that, even when / < oo, we have a situation in which the expected or mean 
position of a state x after one step is y where \y\ > \x\. Thus excessive probability in the 
tails cannot be shifted inward to match the target density. 

The lack of convergence in this region was already noted for the Metropolis Adjusted 
Langevin Algorithm (MALA), a special case of the SMC algorithm where h = 21. As shown 



m 
is 



32| , if TT is bounded, a sufficient condition for MALA to fail to be geometrically ergodic 



limi,jl^'°f:'"'l>i 

|a::|->oo |a;| S 

where s is the single stepsize control parameter for that algorithm. The equivalence to the 



SMC method is established by setting / = 1/s. Thus, for the Gaussian target density vr, 
the condition is / < 1/4. Referring to the sohd white hne in Fig. [3t^a), the non-convergent 
parameter regime for MALA hes along the line segment h = 21 with / < 1/4 which matches 
exactly with the boundary we have determined using the variational method. 

The h = 1/2 "trough" is a special case where we have oscillatory behavior. That is, 
the second eigenvalue is negative but greater than —1 and in fact convergence does occur. 
Interestingly setting h = k/2 means that b{k,h,l) = k and the acceptance probability of 
Eqn. [To] looks again like that of the RWMH algorithm, but the convergence is actually 
faster. In a sense, given that the deterministic part of the step moves x — —x and the 
target distribution is symmetric, the oscillatory behavior allows the chain to sample the 
distribution twice as fast. 

B. Quartic target density 

In scientific or statistical applications where MCMC is used, the log of the target density 
will ordinarily have higher order terms beyond the quadratic order we studied in the previous 
section. For example, in a Bayesian inference problem the posterior distribution will rarely 
have a simple Gaussian form. Both finding the maximum a posteriori parameter estimates 
and sampling from the posterior are made more difficult in the presence of these higher order 
terms. 

Therefore, we wish to extend the previous example by studying a target distribution 
of the form 7i{x) = (2(3/^)A;(^/^Vr(l/4))exp(-A;a;V2). Here, the log of the target density 
is quartic and the proposal density (Gaussian) no longer has the same form as the target 
density. We would like to understand the performance of the Monte Carlo algorithms in this 
circumstance. (The test distribution is taken to be oc xexp {—ax'^/2), i.e. in the orthogonal 
space to the stationary distribution). 

The goal is to estimate the optimal value of /, as before. We can argue approximately that 
the step scale should be such that kx'^/2 ^ 1 for a typical move x, i.e. the change in energy 
is about 1 and the acceptance probability is therefore exp(— 1). This gives a typical value 
for = \f2l\fk. Since the proposal density is Gaussian with variance 1//, we therefore 
would naively predict / = \fkl \f2. Applying the variational method, we were unable to find 
a closed form solution to Eqn. [H] so we had to resort to numerical integrals in determining 



the bound in Eqn. [5l The results are shown in Fig. [5] for the RWMH method; it suggests 
an optimal choice for the step size parameter, /, which is an improvement over our initial 
guess of 1/a/2 ^ .71 (when k = 1). Using the formulas for the optimal step scale from 
coincidentally yields about .71, also a little off from our variational estimate. 
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FIG. 5: Second eigenvalue estimate from the variational method (solid line) and empirical estimates 
(stars), for the quartic target density {k = 1) using the RWMH method, Eqn. [71 The numerical 
values for A* are now estimated by taking the ratio of the discrepancy from the target density in 
subsequent iterates and finding a single multiplicative factor which describes the decay. This is 
done rather than using functional forms analogous to Hermite polynomials to fit the decay, because 
it appears that there may be more significant contributions from higher order terms. This also 
explains why the lower bound shown differs more than in Fig. [T]and Fig. [3] (b). The data point 
shown at / = l/\/2 ~ .71 (see text) does not appear to be optimal. 

Turning to the smart Monte Carlo algorithm, if we wish to make the deterministic part 
of the proposed move a Newton step using the Hessian of — log(7r(x)) at x = we are left 
with a singular Hessian and an infinite deterministic step, reinforcing the need for the step 
length control parameter, h. 

Surprisingly, we find that, independent of the value of h and /, {k fixed at 1), the scalar 
product (ya,Sva) — * 1 as a ^ 0. Thus there are no choices of scaling parameters which will 
lead to convergence. This is borne out by numerical simulation, see Fig. [6] for the changes 




in an initial density under many iterates of the algorithm with an arbitrary choice for s, h. 




FIG. 6: Forty iterates of the smart Monte Carlo algorithm (solid lines), Eqn. [SJ when the target 
density is quartic (dashed line) . The initial distribution of points is normal with standard deviation 
about five times that of target density (dashed line). Parameters are arbitrarily chosen as {h = 1.0, 
/ = 1.0), and we see that the tails of the initial distribution are unchanged for every iterate of the 
algorithm. Other parameter sets tested lead to the same behavior. 



The failure of the smart Monte Carlo method for the quartic problem is clearly due to 
non-convergence of the tails of the distribution, and can be seen by analyzing the integrals 
defining the operator, Eqn. [6l and noting that they all tend to zero as the variational 
parameter tends to zero, independent of the choice for k, h and I. 

As a partial check on this result, we again apply the condition derived in 32] for 
the MALA algorithm, which states that geometric convergence is not possible when 
liminf|^.|^oo |V log7r(x)|/|x| > 4/s where s is the step scale parameter. Now, for the quartic 
density vr, the quantity on the left of the inequality oo, so no value of s can give geometric 
convergence. MALA is a special case of the SMC algorithm, but we have shown here that 
the latter also has convergence problems indicated by A* 1, for all values of its scaling 
parameters, h and /. 

The Gaussian and quartic problems are representative examples of target densities on 
which we have tested the smart Monte Carlo method. As we have seen there are severe 



convergence problems on these distributions. We would expect that for real applications, 
where the log of the target density would contain components of these and higher order 
nonlinearities, similar convergence difficulties for the smart Monte Carlo method would 



occur. It may well be that in applications where the method is extensively used (e.g. 



33 



35j) the convergence criteria are less precise than ours. For example, it may be acceptable to 



merely monitor the variance of some function of the state space variables and conclude that 



convergence has been achieved when it ceases to change appreciably, or as in 
efficiency by the integrated autocorrelation time. 
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IV. MULTIDIMENSIONAL TARGET DENSITIES 



For multidimensional problems, it is quite common to find a large range of scales associ- 
ated with the target density [s], y, 2] . That is, the curvature of the probability density along 
some directions in the parameter space is much larger than in other directions. Clearly, if 
an MCMC method is not designed to take these different scales into account through im- 
portance sampling, the algorithm will perform very poorly. If the curvature is very high in 
a particular direction and we try to take a moderately sized step, it will almost certainly be 
rejected but if we take small steps in directions that are essentially flat the MCMC algorithm 
will be very slow to equilibrate. We would like to show explicitly here what happens to the 
convergence rate when the scale of the problem has been underestimated or overestimated. 

The variational calculations for the one dimensional examples of the previous section 
either yielded explicit formulas or gave integrals that were relatively fast to compute nu- 
merically. However as we go to multiple dimensions neither of these features are present, 
in general. Typically the integrals describing {va,Sva) will not factor into one dimensional 
integrals. For Gaussian target densities the full space is broken into regions analogous to 
those in Fig. [21 described by an equation like y^Ay > x*Ax where A is a symmetric n by 
n matrix which is not necessarily positive deflnite. For the RWMH algorithm applied to a 
multivariate Gaussian target density with inverse covariance matrix we have A = K, and 
therefore all the dimensions are coupled through the energy change, Ai? = y^Ky — x*Kx. 
We would still like to be able to get a lower bound on A*, and to this end note that any test 
function orthogonal to the target density will work in Eqn. the bound does not explicitly 
require a variational parameter, however without it the estimate will be less accurate. It is 



still necessary to make choices for the test functions that are both tractable in computing 
{v,Sv) and are "difficult" functions for the given algorithm to converge from, i.e. have a 
significant component along the true second eigenfunction. 

As an example, take the multivariate Gaussian distribution of the form 



with ), and consider using the MH algorithm with importance sampling, i.e. 



where again L is the inverse covariance matrix/step size control term and to simplify we 
assume that both K and L are diagonal. Without any analysis we might guess that the 
optimum choice for L is K. 

First we construct a test function that will provide a useful bound when the proposed 
steps are too large for the natural scale of the problem. For simplicity, consider putting a 
delta function distribution at the origin. If we take large steps the acceptance probability 
should be low and there will be a large overlap between the initial state and the final state. 
In the limit that the proposed steps have infinite length, the initial state will not be changed 
at all and the bound on the second eigenvalue in absolute value will approach one. To 
do this more carefully we define a test function which is a Gaussian whose variance will 
ultimately be taken to zero to represent the delta function. However, we also need to add 
another term to ensure the test function is orthogonal to the target density, in order to apply 
the variational bound. Therefore, for the unsymmetric operator we write the test function 
—A'k{x) + Bwcr{x) where Wcr{x) is the probability density for a multivariate 
Gaussian with covariance matrix cr^J and A and B are constants. For the symmetrized 
operator the trial function is transformed to Vfj{x) = —A^y 7r{x) + Bwa{x)/ \/ti{x). A and 
B are constrained to satisfy the orthogonality relation (t'o-, vr) = and a normalization 
{Vfj^Va) = 1. These lead to the conditions 




(11) 




A = B and 




Then it can be seen that 




where we have used the orthogonahty condition, the fact that Wcr{x) integrates to 1 and that 
S is self-adjoint. Writing out the operator S exphcitly we get 

S^,^] = [ ! ^s{x^y)^dxdy- ! [ t{x,y) ( dxdy 
^/^T{x) V7r(a;)y J J V7r(x) V'^iv) ^ ^ Vv^(^)/ 



dx 



The last term on the right hand side is (1 + -B^)/-B^, making use of the normalization 
condition, so we are left with 



iv„,Sv^) = B' [ [ ^^!^s{x,y)^^^dxdy-B'' [ [ t{x,y) ( ^^^0=] dxdy + l 
J J Jnix) \/Ti{y) J J \ \/ti{x) / 



A/vr(x) ' \/Tr{y) J J ' \^J'K{x) ^ 

Since we are ultimately taking a limit as cr ^ {wcr a delta function) we can make 
approximations to these integrals as follows : 




-s{x,y)^^^dxdy^s{Q,Q) I I 



dxdy 



and 

2 




tix,y) (^^^j dxdy^ J t{Q,y)dy j (^^^j dx . 
Finally by taking a — we have the expression 

(5^0,^0) = 1 - y" KO^y)dy ■ 

As already mentioned, for the multidimensional problem we expect different functional forms 
for the kernels s{x, y) and t{x, y) depending on the initial and final state (x, y) and this is 
what makes decoupling the integrals difficult. However for this choice of test function the 
equation for the boundary (with x = 0) is given by y^Ky = and since K is positive 
semidefinite we always stay on one side of the boundary (the energy never decreases from 
the initial distribution placed at x = 0). Then 

{Svo,vo) = 1 - J exp (^~y\K + L)y^ dy (12) 

where k and ki are the diagonal elements of the diagonal matrices L and K, respectively. 
With no importance sampling we would have L = kl where k would be chosen to make 



sufficiently large steps to enable it to sample 7r(x). A rough argument as follows can give 
some insight into the form of Eqn. [13] : 1 / is a measure of the scale in the z*'* coordinate 
direction of the proposal density, l/^fki is the scale in the i^^ coordinate direction of the 
target density. Suppose that /j ^ ki for each z, that is the scales of the proposal density 
are too large in all directions. Then the ratio of the mean volume of moves generated 
by q{0,y) to the volume occupied by 7i{y) is exactly YYi=i Vh/Vki- Intuitively, this ratio 
is proportional to the acceptance probability, and in the regime li <C ki the acceptance 
probability determines the convergence properties. 

We want to use Eqn. [13] to show how choosing step sizes too large even in one direction 
will result in a very inefficient algorithm. Suppose that for all but one of the directions we 
make li = ki, i = 1, . . . ,n — l which would be roughly the correct scaling in those directions. 
Then the bound on the second eigenvalue is 



From this we can see that as we go to larger and larger step sizes relative to the scale in 



that if one of the directions of the target density has a scale that is considerably smaller 
than the step scales being used in the proposal density, we will get very few acceptances 
and the convergence rate will be close to 0. Hence we see explicitly the need for importance 
sampling to accelerate convergence. 

We would also like to address what happens in the other limit as the step size becomes 
excessively small compared to the natural scale of the problem. (In fact Eqn. [13] gives a 
lower bound of zero in that case which is not surprising as it is based essentially on the term 
in the operator equation which gives the probability of staying at the current state. If we 
take infinitesimally small steps, the acceptance probability will be one and we will never stay 
at the current state). When the step scales are infinitesimally small we expect intuitively 
that the bound on the second eigenvalue will also approach one; even though the acceptance 
ratio is close to one, very small steps will never be able to "explore" the target distribution 
sufficiently. To compute this limit, we propose a test function which has components of the 
target density in all directions except the last, where it has an antisymmetric form to make 
sure it is orthogonal to the target density. With respect to the symmetrized operator S this 




(14) 



the last direction {kn/i. 



n 



oo), the bound on A* increases to 1. Conversely we can argue 



means 



i=l 



Here ^/^^i{x{) is the one dimensional Gaussian density which is the i^^ factor in a diagonal- 
ized multivariate Gaussian density. (Recall that since ti{x) is an eigenfunction of £, then 



7r(x) is an eigenfunction of S.) We still have the problem of decoupling the n-dimensional 
multivariate problem into n one dimensional problems. To manage this we use a device to 
re-express the operator equation , Eqn. [6], explicitly in terms of the change ^{y^Ky — x^Kx). 



(i.e. — log^M), which we denote by AE. That is 





{v,Sv) ~ I J v{x)s{x,y)v{y) dxdy — j j t{x,y){v{x)) dxdy + 1 
Xn'rr{x)q{x,y) 





x^TT{x)q{x,y) 



j min(e"^^, 1) 5 (^AE " ^ '^^^^^^ ~ j j ^^^^ 
y min(e-^^, 1) 6 (^AE " ^ '^iiVi " 2^*^) j j ^^^^ 



Then we use the integral representation of the delta function 6{x) = 2^ / exp(— iwx) dw, 
factor q{x, y) = Y[i=i Qii^i^ Vi)^ ^'^^ rearrange the order of integration to give : 

{v,Sv) = ^ j min(exp(-AE), 1) (^j A{w) e^^{-\w AE) d^u)^ dAE (16) 

where A{w) contains the integration over the now decoupled (x, y) coordinates : 

A{w) = (YIJ j ^i(^i)9i(3;i>?/i)exp Qiw/ci(|/f - x-)^ dxidy-] x (17) 




{xn.yn - a;„)7r„(a;n)gn(a;„, ?/„) exp ( -\wki{y^ - x„) ) dxn dy^ (18) 



^ 1 i^w 



TT — 



(19) 



, . + W))^ (1 + ^w(-i + W))t 

Note that the complex integral with respect to dw has a branch point at the roots of 
(1 + j^w{^—\ + 'u;))t which lie on the imaginary axis at and r2- It simplifies the analysis 
to consider the situation fcj = /j for z = 1, . . . , (n — 1) and assume that n — 1 is even. This 
way, the roots of (1 + w[—\ + w))^~, ri and r2,o, are (n — l)/2 order poles and not branch 
points, also on the imaginary axis. If we now also assume that kn < s„, then we can take 
a contour as shown in Fig. [7] when AE < and a similar one in the lower imaginary 
plane when AE > 0. Thus Eqn. [19] is reduced to a residue term and a real integral which 



Im 




FIG. 7: Contour used to evaluate Eqn. [19] when AE < 0. ri is a branch point and ri^o is a pole of 
order (n — l)/2. The contour is the same for AE < except restricted to the negative imaginary 
plane. 



needs to be evaluated numerically. The result is plotted for n = 11 in Fig. [H] along with 
the bound that came from Eqn. [TH Thus we see the trade off between taking large steps 
that potentially can explore the space quickly but have a higher chance of being rejected 
and taking small steps which will have a high acceptance probability but will be unable 
to sample the space quickly. As we saw when doing the full variational calculation for the 
one dimensional problems, the best step scale to use is not what we may have guessed; the 
natural choice = = 1 here does not appear to minimize the second eigenvalue. We 
believe this kind of "approximate" variational approach may be a useful way to deal with 
problems which are difficult to analyze otherwise. 



V. CONCLUSION 



By applying a variational method, it is possible to obtain an accurate (lower bound) 
estimate for the second eigenvalue of an MCMC operator and thus bound the asymptotic 
convergence rate of the chain to the target distribution. Given such an estimate we can 
optimally tune the parameters in the proposal distribution to improve the performance of 
the algorithm. The procedure has a role to play between the various numerical algorithms 
that perform convergence diagnostics before the full simulations are run, to allow the user 
to manually tune parameters, and the adaptive schemes j^, 36 1 that require no preliminary 



exploration. The simulations we performed to confirm our variational bounds in the case of 
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FIG. 8: Lower bound on second eigenvalue for the multivariate Gaussian problem, Eqn. \Tl\ with 
n = 11. Step scale = 1 //„ . A;„ = 1 sets the scale of the target density in the last direction. The 
test function is chosen as the negative of the target density perturbed by a delta function (solid 
line) or as the target density itself in all directions but the last (dashed line) . The estimate for the 
lower bound is a maximum of the two curves. 



a one dimensional target density and varying one step scale parameter, Fig. [T] and Fig. [3](b), 
would be infeasible to do as we move to higher dimensions and as we vary additional algo- 
rithm parameters. It is in those situations that the variational method can more efficiently 
identify regions of optimality. 

In addition, the variational method allows us to discover weaknesses in variants of the 
Random Walk Metropolis-Hastings algorithm which on the surface appear to be reasonable 
prescriptions for sampling the target density. This is most dramatically seen in the smart 
Monte Carlo method discussed above which apparently has serious flaws for even the sim- 



plest of one dimensional target densities. Alt 



used in molecular dynamics applications 



nou 
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gh the smart MC method has been widely 



35| the scales are often chosen by physical 



considerations (for example, to not exceed significantly the step sizes needed to accurately 
describe the dynamical evolution of the system) and furthermore, the diagnostics of conver- 
gence are not as rigorous as ours; typically a physical quantity is monitored till it appears 
to reach an equilibrium value, the rare events which correspond to the tails of the target 



distribution are possibly of lesser importance in those studies. Therefore the convergence 
problems we have discussed here specifically in relation to the smart Monte Carlo method, to 
our knowledge, have not been previously examined. Presumably the convergence problems 
can be corrected by a more careful discretization of the underlying diffusion equations, as 
was shown for the related Langevin-type methods [s^]. 

It would be interesting to app ly the same technique to the more broadly used gradient 



based hybrid MC algorithms 
tempering 



38| and other non-adaptive accelerated methods (e.g. parallel 



39| ) where the alternative techniques for determining convergence via diffusion 



approximations may be harder to apply. More generally, the variational analysis could be a 
useful tool in making comparisons between the convergence properties of the latest MCMC 
algorithms without extensive numerical simulation. 
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